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Abstract 

Markov jump processes (MJPs) are used to 
model a wide range of phenomena from dis¬ 
ease progression to RNA path folding. How¬ 
ever, maximum likelihood estimation of para¬ 
metric models leads to degenerate trajectories 
and inferential performance is poor in nonpara- 
metric models. We take a small-variance asymp¬ 
totics (SVA) approach to overcome these limita¬ 
tions. We derive the small-variance asymptotics 
for parametric and nonparametric MJPs for both 
directly observed and hidden state models. In 
the parametric case we obtain a novel objective 
function which leads to non-degenerate trajecto¬ 
ries. To derive the nonparametric version we in¬ 
troduce the gamma-gamma process, a novel ex¬ 
tension to the gamma-exponential process. We 
propose algorithms for each of these formula¬ 
tions, which we call JUMP-means. Our experi¬ 
ments demonstrate that JUMP-means is compet¬ 
itive with or outperforms widely used MJP in¬ 
ference approaches in terms of both speed and 
reconstruction accuracy. 


1. Introduction 


gression (Mandel 2010) and RNA path folding (Hajiaghayi 
et al.[ |2014j), or when the state is only observed indirectly, 


as in corporate bond rating (Bladt & Sprerisen 2009). For 
example, consider the important clinical task of analyzing 
physiological signals of a patient in order to detect abnor¬ 
malities. Such signals include heart rate, blood pressure, 
respiration, and blood oxygen level. For an ICU patient, 
an abnormal state might be the precursor to a cardiac ar¬ 
rest event while for an epileptic, the state might presage a 
seizure (Goldberge r et al.| 2000). How can the latent state 
of the patient be inferred by a Bayesian modeler, so that, 
for example, an attending nurse can be notified when a pa¬ 
tient enters an abnormal state? MJPs offer one attractive 
approach to analyzing such physiological signals. 

Applying an MJP model to physiological signals presents 
a challenge: the number of states is unknown and must 
be inferred using, for example, Bayesian nonparametric 
methods. However, efficient inference in nonparamet¬ 
ric MJP models is a challenging problem, where exist¬ 
ing methods based on particle MCMC scale poorly and 
mix slowly (Saeedi & Bouchard-Cote 2011 j ). Current 
optimization-based methods such as expectation maxi¬ 
mization (EM) are inapplicable if the state size is countably 
infinite; hence, they cannot be applied to Bayesian non¬ 
parametric MJP models, as we would like to do for physi¬ 
ological signals. 


Markov jump processes (MJPs) are continuous-time, 
discrete-state Markov processes in which state durations 
are exponentially distributed according to state-specific 
rate parameters. A stochastic matrix controls the probabil¬ 
ity of transitioning between pairs of states. MJPs have been 
used to construct probabilistic models either when the state 
of a system is observed directly, such as with disease pro- 
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Furthermore, although MJPs are viewed as more realistic 


than their discrete-time counterparts in many fields (Rao & 


Teh 2013), degenerate solutions for the maximum likeli¬ 


hood (ML) trajectories for both directly and indirectly ob¬ 
served cases ( jPerkins[|2009j l, and non-existence of the ML 
transition matrix (obtained from EM) for some indirectly 
observed cases (Bladt & Sprensen 2009) present inferen¬ 
tial challenges. Degenerate ML trajectories occur when 
some of the jump times are infinitesimal, which severely 
undermines the practicality of such approaches. For in¬ 
stance, a trajectory which predicts a patient’s seizure for an 
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Notation 

M : number of states 

7 r: initial state distribution 

P: state transition matrix, with entries p ss ' 

\ s : transition rate for state s 

U = (s 0 ,t 0 , , s K -i,t K -i, s K )' MJPtrajectory 

S : the states corresponding to U 
T : the times corresponding to U 

O = { (t,. Si)}: observation times and states of DOMJP* 

t = (n,.. ., tl): observation times of HMJP 

X = {x \,..., Xl)- observations of HMJP 

p sn : probability of observing xi = n when in state s 

3) and two possible observation values 
(so, to, si, ti, S2, G). Right: Notation used for 


Figure 1. Left: Illustrative example for an HMJP (Section [3.2| ) with three hidden states (M 
(N = 2). The observations X, their times r, an (arbitrary) sample MJP trajectory U 
parametric MJPs. *DOMJP = directly observed MJP. 


infinitesimal amount of time is of limited use to the medical 
staff. Fig.[3]shows an example of the degeneracy problem. 

In this paper, we take a small-variance asymptotics (SVA) 
approach to develop an optimization-based framework for 
efficiently estimating the most probable trajectories (states) 
for both parametric and nonparametric MJP-based mod¬ 
els. Small-variance asymptotics has recently proven to be 
useful in estimating the parameters and inferring the latent 
states in rich probabilistic models. SVA extends the well- 
known connection between mixtures of Gaussians and k- 
means: as the variances of the Gaussians approach zero, 
the maximum a posteriori solution to the mixture of Gaus¬ 


sians model degenerates to fc-means solution (Kulis & Jor- 
|dan| |2012| >. The same idea can be applied to obtain well- 
motivated objective functions that correspond to a latent 
variable model for which scalable inference via standard 
methods like MCMC is challenging. SVA has been applied 
to (hierarchical) Dirichlet process mixture models (Kulis 
|& Jordan} |2012||Jiang et ak 2012[>, Bayesia n nonparamet- 
ric latent feature models (Broder ick et al.| |201 3| >, hidden 
Markov models (HMMs), and infinite-state HMMs (Roy 
chowdhury et al. 2013[>. 


We apply the SVA approach to both parametric and 
Bayesian nonparametric MJP models to obtain what we 
call the JUMP-means objective functions. In the paramet¬ 
ric case, we derive a novel objective function which does 
not suffer from maximum likelihood’s solution degeneracy, 
leading to more stable and robust inference procedures in 
both the directly observed and hidden state cases. Infinite- 
state MJPs (iMJPs) are constructed from the hierarchical 


gamma-exponential process (HIEP) (Saeedi & Bouchard 
Cote 2011(>. In order to apply SVA to iMJPs, we generalize 


the HI EP to obtain the first deterministic procedure (we 
know of) for inference in Bayesian nonparametric MJPs. 


We evaluate JUMP-means on several synthetic and real- 
world datasets in both the parametric and Bayesian non¬ 


parametric cases. JUMP-means performs on par with or 
better than existing methods, offering an attractive speed- 
accuracy tradeoff. We obtain significant improvements in 
the non-parametric case, gaining up to a 20% reduction in 
mean error on the task of observation reconstruction. In 
summary, the JUMP-means approach leads to algorithms 
that 1) are applicable to MJPs with Bayesian nonparamet¬ 
ric priors; 2) provide non-degenerate solutions for the most 
probable trajectories; and 3) are comparable to or outper¬ 
form other standard methods of inference both in terms of 
speed and reconstruction accuracy. 

2. Background 

2.1. Markov Jump Processes 

A Markov jump process (MJP) is defined by (a) a finite (or 
countable) state space, which we identify with the integers 
[M] = {1,..., M}; (b) an initial state probability distri¬ 
bution 7r; (c) a (stochastic) state transition matrix P with 
p ss = 0 for all s £ [M]; and (d) a state dwell-time rate 
vector A = (Ai,..., Am)- The process begins in a state 
so ~ 7T. When the process enters a state s, it remains there 
for a dwell time that is exponentially distributed with pa¬ 
rameter X s . When the system leaves state s, it transitions 
to state s' ^ s with probability p ss ' ■ 

A trajectory of the MJP is a sequence of states and a 
dwell time for each state, except for the final state: U = 
U T = (s 0 ,t 0 ,si,t 1 ,...,s K _- L ,t K _ 1 ,s K ). Implicitly, K 
(and thus U) is a random variable such that tx-i < T 
and the system is in state sk at time T. Let S = St = 
(s 0 ,si,...,sir) andT = Tt = (to, h ,. . ., t K -i) be the 
sequences of states and times corresponding to U. The 
probability of a trajectory is given by 

p(U\n,P,\) = l[t.<T]e- x ^(T-t.) nso (1) 

x nf=i A Sfc _ 1 e _Aa '=-i tfc - 1 p S)i _ lSfc , 
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where t. = Y^k=o anc * ^[‘] * s the indicator function. 
In many cases when the states are directly observed, the 
initial state and the final state are observed, in which case 
it is straightforward to obtain a likelihood from ([TJ. 

A hidden state MJP (HMJP) is an MJP in which the states 
are observed indirectly according to a likelihood model 
p(x | s), s £ [M], x £ X, where X is some observation 
space. The times of the observations r = 
are chosen independent of U , so the probability of the ob¬ 
servations A” = (ati, ... ,xl) is given by p(X \U,t) = 
Tlfci p( x f. I SrJ; where, with an abuse of notation, we 
write s T for the state of the MJP at time r. 

2.2. Previous Approaches to MJP Inference 

There are a number of existing approaches to inference and 
learning in MJPs. An expectation-maximization (EM) al¬ 
gorithm can be derived, but it cannot be applied to mod¬ 
els with countably infinite states, so it is not suitable for 
iMJPs (Lange 2014) (iMJPs are detailed in Section [4). 
Moreover, with discretely observed data, the maximum- 
likelihood estimate with finite entries for the transition ma¬ 


trix obtained from EM may not exist (Bladt & Sprerisen 
|2005l >. 

Maximum likelihood inference amounts to finding 
maxu Inp(U \ tt, P, \), which can be carried out effi- 

2009) . 


ciently using dynamic programming (Perkins 


However, maximum likelihood solutions for the trajectory 
are degenerate: only an infinitesimal amount of time is 
spent in each state, except for the state visited with the 
smallest rate parameter (i.e., longest expected dwell time). 
Such a solution is unsatisfying and unintuitive because 
the dwell times are far from their expected values. Thus, 
maximum likelihood inference produces results that are 
unrepresentative of the model behavior. 

Markov chain Monte Carlo methods have also been devel¬ 
oped, but these can be slow and their convergence is often 
difficult to diagnose (Rao & Teh 2013) . Recently, a more 
efficient Monte Carlo method was proposed in Hajiaghayi 
et al. (2014) which is based on particle MCMC (PMCMC) 


methods (Andrieu et al.| 2010). This approach addresses 


the issue of efficiency, but since it marginalizes over the 
jump points, it cannot provide probable trajectories. 

2.3. Small-variance Asymptotics 

Consider a Bayesian model p(D \ Z, 9 , cr 2 )p(Z, 9) in which 
the likelihood terms contain a variance parameter a 2 . 
Given some data D, a point estimate for the param¬ 
eters 9 and latent variables Z of the model can ob¬ 
tained by maximizing the posterior p(Z, 9\D,a 2 ) ex 
p(D | Z, 9 , cr)p(Z , 9), resulting in a maximum a posteriori 


(MAP) estimate. In the SVA approach (Broderick et al. 


2013) , the MAP optimization is considered in the limit as 
the likelihood variance parameter is taken to zero: rr 2 —> 0. 
Typically, the small-variance limit leads to a much sim¬ 
pler optimization than the MAP optimization with non-zero 
variance. For example, the MAP objective for a Gaussian 
mixture model simplifies to the fc-means objective. 

3. Parametric MJPs 

3.1. Directly Observed MJP 

Consider the task of inferring likely state/dwell-time se¬ 
quences given O = {(tj, Sj)}( =1 , the times at which the 
system was directly observed and the states of the system 
at those times. For simplicity we assume that f 0 = 0 and 
that all times are in the interval [0, T]. Let s(U,t) be the 
state of the system following trajectory U at time t. The 
likelihood of a sequence is 


m I O, P, A) = 1 [t.< T } ti) = Si] 


i=l 


K 


( 2 ) 


n As 

S.fc=l 


—X. 




Ps k -1 


0 -Xs K (T-t.) 


We also place a gamma prior on the rate parameters A 
(detailed below). Instead of relying on MAP estimation, 
we apply a small variance asymptotics analysis to obtain 
more stable objective function. Following (Jiang et al. 


2012 1 , we scale the distributions by an inverse variance 
parameter /3 and then maximize the scaled likelihood and 
prior in the limit p —> oo (i.e., as the variance goes to zero). 

Scaling the exponential distribution f(t; A) = Aexp(—At) 
produces the two-parameter family with 


ln/(f; A,/3) = 

— P ( Xt — In t — In A — 


P ln/3 — lnr(/3) Inf 


P 


+ 


P 


which is the density of a gamma distribution with shape 
parameter p and rate parameter P A. Hence, the mean of 
the scaled distribution is A and its variance is Letting 
F(t; A, P) denote the CDF corresponding to /(t; A, /?), we 
have 1 — F(t; A, p) = , where T(-, •) is the upper 

incomplete gamma function. 

The multinomial distribution is scaled by the parameter 
P = £P. Writing the likelihood with the scaled exponential 
families (and dropping indicator variables) yields: 


t(U\0,P,X) 


K -1 


P 


( 3 ) 


+ ^ ] (£l n PsfcSfc + i H ~ ^Sk^k 


k=0 
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y- 1 ( /31n/3 — lnT(/3) + \nt k \\\ 

k =0 ' P P ' / J 


3 . 2 . Hidden State MJP 

For an HMJP, the likelihood of a valid trajectory is 


The modified likelihood is for a jump process which is 
no longer Markov when fj / 1. We also place a 

Gam(aA, a.\p\) prior on each A* and set a\ = £\/3. It 
can be shown (see the Supplementary Material for details) 
that, when (3 — > oo, the MAP estimation problem becomes 


p{U | X,t,P, A) = ( | s Tt ) 


i =i 


K 


(5) 


nw- v '‘-v« f- 1 - 1 '-' 1 . 


\k =1 


^ X-l K-l 

I ^ E lnp SfcSfc+ i + 'y ' (X Sk t k luA S)s tfc 1) 

’ ’ fc=o k =o 

+ 1[A SK t. > l](X BK t.-hiX BK t.-l) (4) 

M ^ 

+ €\'^2{p\X s — InA s -1)1. 

S=1 ' 

The optimization problem ( |A.4| > is very natural and of¬ 
fers far greater stability than maximum likelihood opti¬ 
mization. As with maximum likelihood, the lnp SfeSfc+1 
terms penalize transitions with small probability. The term 
h(t k ) = A Sk t k — In A Sk tk — 1 is convex and minimized 
when t k = l/A Sfe , the expected value of the dwell time 
for state s k . As t k -A 0, h(t k ) approaches oo, while for 
tk ^ l/A Sfc , h(tk ) grows approximately linearly. Thus, 
times very close to zero are heavily penalized while times 
close to the expected dwell time are penalized very little. 
The term 1[A SK t. > l](A Sjc i. — lnA Sjf f. — 1) penalizes 
the time t. spent in state s k so far in the same manner as 
a regular dwell time when t. is greater than the expected 
value of the dwell-time. However, when t. is less than the 
expected value there is no cost, which is quite natural since 
the system may remain in state s k for longer than t. — i.e., 
there should not be a large penalty for t. being less than its 
expected value. Finally, parameters and px have a very 
natural interpretation (cf. ([8) below): they correspond to a 
priori having £> dwell times of length p\ for each state. 


Hence, the only difference between the directly observed 
case and the HMJP is the addition of the observation like¬ 
lihood terms. Because multinomial observations are com¬ 
monly used in MJP applications, that is the case we con¬ 
sider here. Let N denote the number of possible observa¬ 
tions and p sn be the probability of observing xe = n when 
s Tf = s. The observation likelihoods are scaled in the same 
manner as the transition probabilities, but with (3 = ^/3. 
Thus, for the HMJP, we obtain: 

, L K -1 

^min C E ln ^*e + £ E ln Pe k s k+ 1 
’ ’ ’ p ^ e=i k -o 

K—X 

+ E i^ktk - In A Sk t k - 1) (6) 

k—0 

+ l[Ag K f. > l](A SK f. - In X SK t. - 1) 

m >. 

+ Ca^(maA s - In A s - 1) 

s=l J 

3.3. Algorithm 

Optimizing the JUMP-means objectives in ( |A.4| ) and ([6} is 
non-trivial due to the fact that we do not know the number 
of jumps in the MJP, and the combinatorial explosion in 
the sequences with the number of jump points. The terms 
involving the continuous variables t k (dwell times) present 
an additional complexity. 


Comparison to Maximum Likelihood MJP trajectories 
estimated using maximum likelihood (MLE) are usually 
trivial, with the system spending almost all its time in a 
single state (with the smallest A), with infinitesimal dwell 
times for the other states. This poor behavior of MLE is due 
to the fact that the mode of Exp(A), which is favored by the 
MLE, is 0, even though the mean is 1/AQ The SVA opti¬ 
mization, on the other hand, does give trajectories that are 
representative of the true behavior because the SVA terms 
of the form A t — ln(Af) — 1 are optimized at 1/A (i.e., at the 
mean of Exp(A)). We demonstrate the superior behavior 
of the SVA in the concrete example of estimating disease 
progression in patients in Section[5] 

'Note that placing priors on the rate parameters, as we do, 
does not affect the degeneracy of the ML trajectory. 


We therefore resort to an alternating minimization proce¬ 
dure to optimize the JUMP-means objective function, sim¬ 
ilar in spirit to the one used in Roychowdhury et al. ( 2013j >. 
In each iteration of the optimization process, we first use 
a modified Viterbi algorithm to obtain the most likely state 
sequence. Then, we use convex optimization to distribute 
the jump points optimally with respect to the values from 
A for the current state sequence. 


Directly Observed MJP When optimizing ( |A.4| >, there 
may be many sequences (O’s) available, representing dis¬ 
tinct realizations of the process. We use the following al¬ 
gorithm to optimize (|A.4|): 


1. Initialize the state transition matrix P and rate vector A 
with uniform values. 
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2. For every observation sequence O, instantiate the jump 
points by adding one jump point between every pair of 
observations, in addition to the start and end points. 

3. For each O, use a modified Viterbi algorithm, to find 
the best state sequence to optimize ( |A.4| ), while keeping 
the jump points fixed. The modified algorithm includes 
the dwell time penalty terms, which are dependent upon 
the assignment of states to the time points. 

4. Optimize the dwell times tk with the state sequences of 
the trajectories fixed. 

5. Optimize P and A with the other variables fixed. The 
optimal values can be obtained in closed form. For ex¬ 
ample, if there is only a single observation sequence O 
with corresponding inferred trajectory S, then 


Pmj = m m3 , rn,j G [M\ (7) 

Z^j—l n mj 

\ = ^ + ^k t isk = m] 

txPx + E fe l[sfc = m]i fc ’ W 


where n. m] denotes to the number of transitions from 
state m to state j in S. 

6 . Repeat steps 3-5 until convergence. 


Beam Search Variant We note that the optimization pro¬ 
cedure just described is restrictive since the number of 
jump points is fixed and the jump points are constrained 
by the observation boundaries. To eliminate this, we also 
tested a beam search variant of the algorithm to allow for 
the creation and removal of jump points, but found it did 
not have much impact in our experiments. 


Hidden State MJP The algorithm to optimize the hidden 
state MJP JUMP-means objective ([ 6 ]) is similar to that for 
optimizing ( |A.4| ), but with three modifications. First, in 
place of O, we have the indirect observations of the states 
X. Second, observation likelihood terms containing p are 
included in the objective minimized by the Viterbi opti¬ 
mization (step 3). Finally, an additional update is per¬ 
formed in step 5 for each of the observation distributions 
Pm' 


Pmn — 


= m\t[xt = n\ 
= m] 


(9) 


for m G [M] and n G [N], If each p m = (p ml ,..., p mN ) 
is initialized to be uniform, then the algorithm converges to 
a poor local minimum, so we add a small amount of random 
noise to each uniform p m . 


4. Bayesian Nonparametric MJPs 

We now consider the Bayesian nonparametric MJP (iMJP) 
model. The iMJP is based on the hierarchical gamma- 
exponential process (HrEP), which is constructed from the 


gamma-exponential process (rEP). We denote the Moran 
gamma process with base measure H and rate parameter 
7 by rP(£T, 7 ) (Kingman, 1993). The HrEP generates a 
state/dwell-time sequence so, t\, si, t 2 , S 2 , to, S 3 ,... (with 
so assumed known) according to (|Saeedi & Bouchard-| 

[dkil[20TTT ): 


/z 0 ~ rP(a 0 iJo,7o), (10) 

m\po rp(/i 0 , 7 ), * = 1,2,..., (11) 

( 12 ) 

tk | {pi}iL 0 Mk-l ~ Exp (||// Sfc _ 1 II), (13) 


where Hq is the base probability measure, ao is a concen¬ 
tration parameter, U k = (s 0 , ti, si, t 2 , s 2 ,..., ffc-i, Sfc), 
Pi = /Xj/ 1 |/tj||, and \\p\\ denotes the total mass of the mea¬ 
sure p. As in the parametric case, we must replace the 
exponential distribution in © with the scaled exponen¬ 
tial distribution. After an appropriate scaling of the rest 
of the hyperparameters, we obtain the hierarchical gamma- 
gamma process (HTTP). The definition and properties of 
the HrrP are given in the Supplementary Material. 

Let M denote the number of used states, K m the number of 
transitions out of state m, and p l3 the mass on the j-th com¬ 
ponent of the measure pi. For 0 < i < M, 1 < j < M, let 
7 f ij = jlij and for 0 < i < M, let 7f ijM +i — 1 - Ej=i Pip 
Let C = be the waiting times follow¬ 
ing state m and define t* n . = 1 t* m j ■ l n order to re¬ 

tain the effects of the hyperparameters in the asymptotics, 
set ao = exp(—^i/3) and 70 = «o = ^ 2 - It can then 
be shown that (see the Supplementary Material for details), 
when P — >• 00 , the iMJP MAP estimation problem becomes 


L K 

„ min c ln Ps r e *t + ^ ln 

K,Uk,^,P — 1 c — 


1=1 


k =1 


M 


+ ^M+^^KL(7fo||7f m ) (14) 

m =1 
M 

- - K m In ([7 + t*m]/ K m )} ■ 


Like its parametric counterpart, the Bayesian nonparamet¬ 
ric cost function penalizes dwell durations very close to 
zero via the ln terms. In addition, there are penal¬ 
ties for the number of states and the state transitions. The 
observation likelihood term in ( fl4| > favors the creation of 
new states to minimize the JUMP-means objective, while 
the state penalty £ 1 M and the non-linear penalty term 
K rn ln ([7 + t* n ]/K rn ) counteracts the formation of a long 
tail of states with very few data points. The 7 hyperparam¬ 
eter introduces an additional, nonlinear cost for each addi¬ 
tional state — if a state is occupied for ( 1 ( 7 ) time, then the 
7 term for that state does not have much effect on the cost. 
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Table 1. Statistics and Mean observation reconstruction error for the various models on different datasets. Key: BL = Baseline; P = para¬ 
metric; SVA = JUMP-means; NP = nonparametric; DO = directly observed; H = hidden; MS = multiple sclerosis data set; MIMIC = blood 
pressure data set. *Best result obtained by running EM with various number of hidden states (up to 12). 



Data Set 

# Points 

Held Out 

# States 

BL 

EM 

Mean Error (%) 
MCMC SVA 

PMCMC 

Synthetic 1 (P-DO) 

10,000 

30% 

10 

69.7 

40.2 

41.9 

41.2 

- 

Synthetic 2 (P-H) 

10,000 

30% 

5 

51.8 

42.9 

74.6 

46.5 

- 

MS (P-DO) 

390 

50% 

3 

51.2 

26.2 

48.1 

25.4 

- 

MIMIC (NP-H) 

2,208 

25 % 

- 

42.3 

25.7* 

- 

24.3 

30.9 


The KL divergence terms between ftp and Jf m arise from 
the hierarchical structure of the prior, biasing the transition 
probabilities Jf m to be similar to the prior ftp. 


4.1. Algorithm 

For the iMJP case, we have the extra variables M and 
{^m}m =o to optimize. In addition, the number of variables 
to optimize depends on the number of states in our model. 
The major change in the algorithm from the parametric case 
is that we must propose and then accept or reject the addi¬ 
tion of new states. We propose the following algorithm for 
optimizing the iMJP: 


(1) Initialize p, 7To and 7ti with uniform values and set the 
number of states M = 1. 

(2) For each observation sequence, apply the Viterbi al¬ 
gorithm and update the times using the new objective 
function in ( fl4| ), analogously to steps (3) and (4) in the 
parametric algorithm. 

(3) Perform MAP updates for p (as in |9|) and ff: 


“b £2^0 j 

Tmj ~ ^ c _ > 
S Zjj=1 n mj + S2 7r Oj 

M 

%«n 

m= 1 


m,j £ [M] 


(15) 

(16) 


(4) For every state pair m,m' £ [M], form a new state 
M + 1 by considering all transitions from m to m! and 
reassigning all observations xt that were assigned to 
to' to the new state. Update if and p to estimate the 
overall objective function for every new set of M + 1 
states formed in this way and accept the state set that 
minimizes the objective. If no such set exists, do not 
create a new state and revert back to the old if and p. 


(5) Repeat steps 2-4 until convergence. 


Remark. If instead of multinomial observations we have 
Gaussian observations, the parameter p s is replaced with 
the mean parameter \i s . In this case, we update the mean 
for each state using the data points assigned to the state, 
similar to the procedure for /.-means clustering (see, e.g., 
Jiang et al. ( 2012) l; |Roychowdhury et ah ( 2013| ). 


5. Experiments 

In this section we provide a quantitative analysis of the 
JUMP-means algorithm and compare its performance on 
synthetic and real datasets with standard inference methods 
for MJPs. For evaluation, we consider multiple sequences 
of discretely observed data and randomly hold out a subset 
of the data. We report reconstruction error for performance 
comparison. 


5.1. Parametric Models 


For the parametric models, we compare JUMP-means to 
maximum likelihood estimation of the MJP parameters 
learned by EM ( Asger & Ledet||2005| ), the MCMC method 
proposed in Rao & Teh ( 201 3[ > and a simple baseline where 
we ignore the sequential structure of the data. We run three 
sets of experiments (2 synthetic, 1 real) for our evaluation. 


Synthetic 1: Directly Observed States For evaluating the 
model on a directly observed process, we generate 100 dif¬ 
ferent datasets randomly from various MJPs with 10 states. 
To generate each dataset, we first generate the rows of 
the transition probability matrix and transition rates inde¬ 
pendently from Dir(l) and Gam(l, 1), respectively. Next, 
given the rates and transition probabilities for each dataset, 
we sample 500 sequences of length 20. We hold out 30% of 
the observations at random for testing reconstruction error. 


We run JUMP-means by initializing the algorithm with a 
uniform transition matrix P and set the rate vector A to 
be all ones. We run 300 iterations of the algorithm de¬ 
scribed in Section [33] each iteration is one scan through 
all the sequences. We set the hyperparameters £, £>,, and fix 
equal to 1, 1, and .5, respectively. For MCMC, we initialize 
the jump points using the time points of the observations. 
We place independent Dir(l) priors on P and independent 
Gam(l, 1) priors on A. We initialize EM with a uniform P 
and an all-ones A. We run both MCMC and EM for 300 it¬ 
erations, then reconstruct observations using the Bayes es¬ 
timator approximated from the 300 posterior samples. For 
our baseline we use the most common observation in the 
dataset as an estimate of the missing observations. 

Table [T] gives the mean reconstruction error across se¬ 
quences for the various methods. Note that JUMP-means 
performs better than MCMC, and is almost on par with EM. 
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Figure 2. Mean error vs iterations for (a) Synthetic 1; (b) Synthetic 2; (c) MS; and (d) MIMIC datasets. In each case the JUMP-means 
algorithms have better or comparable performance to other standard methods of inference in MJPs. Mean error vs CPU runtime plots 
can be found in the Supplementary Material. 


Fig. |2ja) shows the average error across all the datasets for 
each method versus number of iterations. In terms of CPU 
time, each iteration of JUMP-means (Java), EM (Java), and 
MCMC (Python) takes 0.3, 1.61 and 42 seconds, respec¬ 
tively. We also ran experiments with the beam search vari¬ 
ant described in Section |3.3[ however, we did not obtain 
any significant improvement in results. 

Synthetic 2: Hidden States For the hidden state case, we 
generate 100 different datasets for MJPs with 5 hidden and 
5 observed states, with varying parameters as above. In 
each dataset there are 500 sequences of length 20. In addi¬ 
tion to parameters in the directly observed case, we gener¬ 
ate observation likelihood terms for each state from Dir(l). 

We initialize the transition probabilities and the rate vectors 
for JUMP-means, MCMC and EM in a fashion similar to 
the directly observed case. For the observation likelihood 
p , we use Dir(l) as a prior for MCMC, uniform distribu¬ 
tions for EM initialization and a uniform probability matrix 
with a small amount of random noise for JUMP-means ini¬ 
tialization. We set £>,, p\ as before and £ to 1. 




Weeks (JUMP-means) Weeks (ML) 



Figure 3. Top: Latent trajectories inferred by JUMP-means and 
ML estimate for a patient in the MS dataset. Bottom: Latent tra¬ 
jectory inferred by JUMP-means for a patient in MIMIC dataset. 


We run each algorithm for 300 iterations. For JUMP- 
means, we use the hidden state MJP algorithm described 
in Section |3.3| Table |T] and Fig. [2] b) again demonstrate 
that JUMP-means outperforms MCMC by a large margin 
and performs comparably to EM. The poor performance 
of MCMC is due to slow mixing over the parameters and 
state trajectories. The slow mixing is a result of the cou¬ 
pling between the latent states and the observations, which 
is induced by the observation likelihood. 

Disease Progression in Multiple Sclerosis (MS) Estimat¬ 
ing disease progression and change points in patients with 
Multiple Sclerosis (MS) is an active research area (see, 
e.g., |Mandel| ( |2010| )). We can cast the progression of the 
disease in a single patient as an MJP, with different states 
representing the various stages of the disease. Obtaining 
the most-likely trajectory for this MJP can aid in under¬ 
standing the disease progression and enable better care. 

For our experiments, we use a real-world dataset collected 
from a phase III clinical trial of a drug for MS. This dataset 


tracks the progression of the disease for 72 different pa¬ 
tients over three years. We randomly hold out 50% of the 
observations and evaluate on the observation reconstruction 
task. The observations are values of a disability measure 
known as EDSS, recorded at different time points. Initial¬ 
ization and hyperparameters are the same as Synthetic 1. 

Table [T] shows that JUMP-means significantly outperforms 
MCMC, achieving almost a 50% relative reduction in re¬ 
construction error. JUMP-means again achieves compara¬ 
ble results with EM. Fig. [3](top panel) provides an example 
of the latent trajectories from JUMP-means and maximum 
likelihood estimate for a single patient. The MLE trajectory 
includes two infinitesimal dwell times, which do not reflect 
realistic behavior of the system (since we do not expect a 
patient to be in a disease state for an infinitesimal amount 
of time). On the other hand, the trajectory produced by 
JUMP-means takes into account the dwell times of the var¬ 
ious stages of the disease and provides a more reasonable 
picture of its progression. 
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Figure 4. Histograms of error reconstruction for runs with differ¬ 
ent hyperparameter settings for (a) MS (P-DO, 48 settings), and 
(b) MIMIC (NPB-H, 1125 settings) datasets. 

5.2. Nonparametric Model 



Figure 5. Runtime and error of nonparametric JUMP-means al¬ 
gorithm with increasing synthetic data size. The runtime scales 
linearly with data size (dashed black line). 


Vital Signs Monitoring (MIMIC) We now consider a ver¬ 
sion of the problem of understanding physiological sig¬ 
nals discussed in the introduction. We use data from the 
MIMIC database ( |Goldberger et al.[|2000||Moody & Mark| 


19961, which contains recordings of several vital parame¬ 


ters of ICU patients. Specifically, we consider blood pres¬ 
sure readings of 69 ICU patients collected over a 24-hour 
period and sub-sample observation sequences of length 32 
for each patient, keeping the start and end times fixedj^For 
testing, we randomly hold out ~25% of the observations. 


To initialize JUMP-means, we choose uniform matrices for 
p, fro and 7 t i and set M = I. The hyperparameters 7 and 
£1 are set to 5, while and £2 are set to 0.005. Us¬ 
ing a Gaussian likelihood model for the observations, we 
run our model for 50 iterations. We compare with par¬ 
ticle MCMC (PMCMC) ( |Andrieu et ah] |2UT0) l and EM. 
PMCMC is a state-of-the-art inference method for iMJPs 
( jSaeedi & Bouchard-Cote| |201 lj i, which we run for 300 
iterations with 100 particles. For PMCMC, we first catego¬ 
rize the readings into the standard four categories for blood 
pressure provided by NIF0 We run EM with a number of 
hidden states from 1 to 12 and report the best performance 
among all the results. For initializing the EM, we use the 
same setting as the Synthetic 2 case. 

For evaluation, we consider the time point of a test obser¬ 
vation and categorize the mean of the latent state at this 
time point (using the same categories obtained above) to 
compare against the actual category. Table [T| shows that 
JUMP-means significantly outperforms PMCMC and ob¬ 
tains a 21% relative reduction in average error rate. Fig. 
[2fc) plots the error against iterations of both algorithms. In 
terms of CPU time, each iteration of JUMP-means (Java) 
and PMCMC (Java) takes 0.17 and 1.95 seconds, respec¬ 
tively. Compared to EM’s error rate of 25.7%, JUMP- 
means reaches a rate of 24.3% without the need to sepa¬ 
rately train for different number of states. The second-best 

2 We use a small dataset for testing since PMCMC cannot eas¬ 
ily scale to larger datasets. 

' http://www.nhlbi.nih.gov/health/ 
health-topics/topics/hbp 


result for the EM had an error of 45%, which shows the 
importance of model selection when using EM. 

Fig. [3](bottom) provides an example of the latent trajectory 
inferred by JUMP-means. The observations are uniquely 
colored by the latent state they are assigned. We note that 
the model captures different levels of blood pressure read¬ 
ings and provides a non-degenerate latent trajectory. 

Hyperparameters A well-known problem when applying 
SVA methods is that there are a number of hyperparame¬ 
ters to tune. In our objective functions, some of these hy¬ 
perparameters ( 7 , fi\, and £\) have natural interpretations 
so prior knowledge and common sense can be used to set 
them, but others do not. Fig. [4] shows histograms over the 
errors we obtain for runs of JUMP-means on the MS and 
MIMIC datasets with different settings. We can see that 
a significant fraction of the runs converge to the minimum 
error, while some settings — in particular when the hyper¬ 
parameters were of different orders of magnitude — led to 
larger errors. Hence, the sensitivity study indicates the ro¬ 
bustness of JUMP-means to the choice of hyperparameters. 

Scaling Fig. [5] shows the total runtime and reconstruction 
error of the non-parametric JUMP-means algorithm on in¬ 
creasingly large amounts of synthetic data. The algorithm 
is able to handle up to a million data points with the run¬ 
time scaling linearly with data size. Furthermore, the error 
rate decreases significantly as the amount of data increases. 
See the Supplementary Material for further details. 

6. Conclusion 

We have presented JUMP-means, a new approach to in¬ 
ference in MJPs using small-variance asymptotics. We de¬ 
rived novel objective functions for parametric and Bayesian 
nonparametric models and proposed efficient algorithms to 
optimize them. Our experiments demonstrate that JUMP- 
means can be used to obtain high-quality non-degenerate 
estimates of the latent trajectories. JUMP-means offers at¬ 
tractive speed-accuracy tradeoffs for both parametric and 
nonparametric problems, and achieved state-of-the-art re¬ 
construction accuracy on nonparametric problems. 




















JUMP-Means: SVA for MJPs 


Acknowledgments 

Thanks to Monir Hajiaghayi, Matthew Johnson, and Te- 
jas Kulkarni for helpful comments and discussions. JHH 
was supported by the U.S. Government under FA9550-11- 
C-0028 and awarded by the DoD, Air Force Office of Sci¬ 
entific Research, National Defense Science and Engineer¬ 
ing Graduate (NDSEG) Fellowship, 32 CFR 168a. 

References 

Andrieu, C., Doucet, A., and Holenstein, R. Particle 
Markov chain Monte Carlo methods. Journal of the 
Royal Statistical Society: Series B (Statistical Method¬ 
ology) , 72(3):269-342, 2010. 

Asger, H. and Ledet, J. J. Statistical inference in evolu¬ 
tionary models of dna sequences via the em algorithm. 
Statistical Applications in Genetics and Molecular Biol¬ 
ogy, 4(1): 1-22, 2005. 

Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 
Clustering with bregman divergences. The Journal of 
Machine Learning Research, 6:1705-1749, 2005. 

Bladt, M. and S0 re risen, M. Statistical inference for dis¬ 
cretely observed markov jump processes. Journal of the 
Royal Statistical Society: Series B (Statistical Methodol¬ 
ogy), 67(3):395-410, 2005. 

Bladt, M. and Sprensen, M. Efficient estimation of tran¬ 
sition rates between credit ratings from observations at 
discrete time points. Quantitative Finance, 9(2): 147- 
160, 2009. 

Broderick, T., Kulis, B., and Jordan, M. I. MAD-Bayes: 
MAP-based Asymptotic Derivations from Bayes. In In¬ 
ternational Conference on Machine Learning, 2013. 

DLMF. NIST Digital Library of Mathematical Functions. 
http://dlmf.nist.gov/. Release 1.0.8 of 2014-04-25. URL 
http://dlmf.nist.gov/ Online companion to 
flOlver et al.|[20l0| . 

Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, 
J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, 
G. B., Peng, C.-K., and Stanley, H. E. PhysioBank, 
PhysioToolkit, and PhysioNet: Components of a new re¬ 
search resource for complex physiologic signals. Circu¬ 
lation, 101(23):e215-e220, 2000. 

Hajiaghayi, M., Kirkpatrick, B., Wang, L., and Bouchard- 
Cote, A. Efficient Continuous-Time Markov Chain Esti¬ 
mation. In International Conference on Machine Learn¬ 
ing, 2014. 

Jiang, K., Kulis, B., and Jordan, M. I. Small-variance 
asymptotics for exponential family dirichlet process 
mixture models. In NIPS, pp. 3167-3175, 2012. 


Kingman, J. F. C. Poisson Processes. Oxford Studies in 
Probability. Oxford University Press, 1993. 

Kulis, B. and Jordan, M. I. Revisiting k-means: New Al¬ 
gorithms via Bayesian Nonparametrics. In International 
Conference on Machine Learning, 2012. 

Lange, J. Latent Continuous Time Markov Chains for 
Partially-Observed Multistate Disease Processes. PhD 
thesis, 2014. 


Mandel, M. Estimating disease progression using panel 
data. Biostatistics, 11(2):304-316, 2010. 


Moody, G. and Mark, R. A database to support develop¬ 
ment and evaluation of intelligent intensive care moni¬ 
toring. In Computers in Cardiology, 1996, pp. 657-660, 
Sept 1996. 


Olver, F. W. J., Lozier, D. W., Boisvert, R. F., and Clark, 
C. W. (eds.). NIST Handbook of Mathematical Func¬ 
tions. Cambridge University Press, New York, NY, 2010. 
Print companion to (DLMF|>. 


Perkins, T. J. Maximum likelihood trajectories for 
continuous-time markov chains. In NIPS, pp. 1437- 
1445, 2009. 


Rao, V. and Teh, Y. W. Fast MCMC sampling for Markov 
jump processes and extensions. The Journal of Machine 
Learning Research, 14( 1 ):3295—3320, 2013. 

Roychowdhury, A., Jiang, K., and Kulis, B. Small-Variance 
Asymptotics for Hidden Markov Models. In Advances in 
Neural Information Processing Systems, pp. 2103-2111, 
2013. 


Saeedi, A. and Bouchard-Cote, A. Priors over recurrent 
continuous time processes. In NIPS, volume 24, pp. 
2052-2060,2011. 

Teh, Y. W„ Jordan, M. I., Beal, M. J., and Blei, D. M. Hi¬ 
erarchical Dirichlet Processes. Journal of the American 
Statistical Association, 101(476): 1566—1581, December 
2006. 








Supplementary Material for 
JUMP-Means: Small-Variance Asymptotics for 
Markov Jump Processes 


A. Parametric MJPs for SVA 


To obtain the SVA objective from the parametric MJP model, we begin by scaling the exponential distribution /(f; A) = 
Aexp(—At), which is an exponential family distribution with natural parameter 77 = —A, log-partition function w(ji) = 
— ln(— 77 ), and base measure v(dt) = 1 (Banerjee et al.J 2005). To scale the distribution, introduce the new natural 
parameter = /3r] and log-partition function ip(fj) = f3ip(fj//3). The new base measure i>(df) is uniquely defined by the 
integral equation (see Banerjee et alT)[2005| Theorem 5) 


exp(fjt)i>(dt) = exp^rj)) = exp(-pin(fj/P)) = ^. 


Choosing F(df) = d( satisfies the condition, so we have 

fit] A,/?) = = exp(-/3Af + (/3 - 1 ) Inf + /31n A/3 — lnT(/?)) 


= exp < —f3 Xt — In t — In A — 


/31n/3 — lnT(/3) Inf A 


P 


P J 


It can now be seen that f(t ; A, ff) is the density of a gamma distribution with shape parameter /3 and rate parameter ;3\. 
Hence, the mean of the scaled distribution is j and its variance is jo. Letting F(t: A, 5) denote the CDF corresponding to 

f{t] A, p), we have 1 — F{t\ A, p) = , where F(-, •) is the upper incomplete gamma function. 

For the state at the fc-th jump we use a 1-of-M representation; that is, is an M-dimensional binary random variable 
which satisfies Skm £ {0,1} and J2m =1 s fcm = 1. Hence, we have: 


M 


p(sk\sk~i,j = 1) = XT 


Skm 


(A.l) 


Given the Bregman divergence for a multinomial distribution, d^fsk-p-) = KL(sfc||p ? ) where = {pji ,... this 

can be written in terms of exponential family notation in the following form (Banerjee et al. 2005)1: 


p{s k \s k -i,j = 1 ) = b^isk) exp(—d^,(sfe,p )) 


(A.2) 


where b^{sk) = 1. For a scaled multinomial distribution we have exp (—pd^isk.Pj)), where 0 = t;p is the scaling 

parameter for the multinomial distribution. Writing the trajectory probility with the scaled exponential families yields: 


k- 1 


nji v> w ) « / lnr (/3) - lnr(/3,/3A Sjf f.) r , n ^ 

p(U\s 0 ,SK,P,S) <x exp j-/3 I - - -h £ KL(s fc+ i||p s J 


E U* - ■» W* - TEAzJTM +¥ ))}, 


(A.3) 
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Since f3 —> oo, we can apply the asymptotic expansions for [''(■) and F(-, •). In particular, applying Stirling’s formula and 
the facts in ( jDLMFj ) we have: 

/^ln/3 — lnT(/3) /31n/3 — /31n/3 + f3 + o(/3) 

0 = 0 * 

Inr(/3) — Inr(/3, /3Xt) \ ^ At - In At - 1 if t > i 

— | P ln/3-/3-;31n/3+/3+o(/3) q jp ^ 1 

We also place a Gam(a A , cexpx) prior on each A*. With ax = £ A /3, we obtain 

lnp(A s | a x , a x p x ) = a x ln(a x p x ) + (a A - 1) In A s - In T(a A ) - a x p x X s 
= £\(3 In A s - £ x p x 0X s + £ A /3 + o{0) 

= -0{£, x p x X s - In A s - 1) + o(/3). 

Hence, when (3 —> oo, obtain 

^ K-l K-l 

jnin j £ X! KL ( s fc+illP s J + Y1 ( Xs * tk ~ hlX ^ tk ~ X ) 

’ ’ ^ k =0 /c—0 

M 

+ 1[A SK t. > l](X SK t. - In A SK t. - 1) + £ a ^(^ a A s - lnA s - 1) 

S=1 

B. Bayesian Nonparametric MJPs for SVA 

First we recall that the Moran gamma process is a distribution over measures. If p ~ FP(7/, 7 ) is a random measure 
distributed according to a Moran gamma process with base measure H on the probability space (17, T) and rate parameter 


7 , then for all measurable partitions of 17, {A \,..., At), p satisfies 

(KAi), ■ ■ .,p{A ( )) ~ Gam(fJ(Ai), 7) x • • • x Gam(Ff(A«), 7). (B.l) 

The hierarchical gamma-gamma process (HTTP) is defined to be: 

/To~rP(a 0 ifo,7o) (B.2) 

Mi I Mo rP(^o,7) * = 1,2,... (B.3) 

Sfc I {MSo.Wfc-!- (B.4) 

tk I {M£o>£4-i ~ Gam(/3, ||/x Sfc _ 1 1|). (B.5) 

Consider the gamma-gamma process (rrP), defined by ( |B.3[ i-( pO] ) (with p 0 treated as an arbitrary fixed measure). We now 
show that the TTP retains the key properties of the PEP: conjugacy and exchangeability. Let 7’, = = 

and L) A ]T \ =1 l[sj_i = A] <5. S j be the sufficient statistics of the observations. 

Proposition B.l. The TTP is a conjugate family: pi \ Uk ~ TP{f3p ' i ,7 where Pi — Po + T an d 7 ' = 7 + T). 


Proof sketch. The proof is analogous to that for Proposition 2 in ( jSaeedi & Bouchard-Cote||2011| ). The key additional 
insight is that X ~ Gam(/la, b) and Y \ X ~ Gam(/3, X) are conjugate: X | y ~ Gam(/3(a + 1), b + Y). □ 

In order to give the joint distribution of the times T = Tk — (ft, ■ • •, tjr), we first derive the predictive distribution for the 
rrP, (sfc + i, tjfc+i) | Uk- We make use of the following family of densities. 

Definition B.2 (Shaped Translated Pareto). Let f3 > 0, a > 0, 7 > 0. A random variable S is shaped translated Pareto , 
denoted S ~ STP(/3, a, 7), if it has density 

fP—l 

ftp) = _ - _ - _ 

JU B(P,a0) (t + 7 )( 1 +“)/ 3 ’ 

where B(a , b) = is the beta function. 
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Proposition B.3. The predictive distribution of the ITP is 

(^fc + l 5 ^fc + 1 ) \^k ^ Ps k ^ ^d~P(/?, 


Sfc II 7 !s k ) 


(B. 6 ) 


Proof. By Proposition |B.1| it suffices to show that if p ~ TP((3po, 7 ), s \ p ~ p, and t \ p ~ Gam(/3, \\p\\), then (s, t) 
p x STP(/3, kq,7 ). where kq = \\po\\- Letting x = ||/i||, the distribution of t is 


POO 

p(t) = P(t I x)p{x) da 
Jo 


f°° x^t^~ 1 e~ xt r yl 3K o^^o-ig-7^ 


ros) 


r(^Ko) 




T(P)T(Pko) Jo 


^(l+«o)-lg-(7+i)x da , = T 


pKOj-p-l 


dx 

r(/3(i + tto)) 


r(/f)r(/?K 0 ) ( 7 + t)W+«o)‘ 


□ 


We can now show that the process is exchangeable by exhibiting the joint distribution of waiting times: 

Proposition B.4. Lett* m = (t* ml ,...,t* mK J be the waiting times following state m. Then is an exchangeable sequence 
with joint distribution 


P(t * m ) 


r(/?(K 0 + K m )) (rfc w ) /3 ~ 1 

r(/3)^ (7 + Ef=iT mi )^ K «+^) 


(B.7) 


Proof sketch. Take the product of the predictive distributions of r m 1 ,..., r,„ k,„. ■ 


□ 


The measures {/ 7 }jL 0 and Hq can be integrated out of the HTTP generative model in a manner analogous to the way in 
the the Chinese restaurant franchise in obtained from the hierarchical Dirichlet process (|Teh et al.j |2006| >. However the 
mass of the measure po cannot be integrated out. We omit details as they are essentially identical to those in case of the 
HrEP ( |Saeedi & Bouchard^C5t6l|20TT] i. 

First, we consider the case of integrating out {/i,}i>o- Let M denote the number of used states, K rn the number of 
transitions out of state in, and r m the number of states that can be reached from state in in one step. The contribution to 
the likelihood from the HrrP prior is 


p{U, K 0 I P, 7o, 7; Ob) = p(k 0 I 7o)p(S I P, Kq)p{T \ P, 7 , «o) 


oc hi. 


«o-W-7o«o n M-l r(ao + !) 

0 r(a 0 + r.) 


M 


n y 


M 

n 


m=1 

T(P(ko + K m )) (nf=l hmjf - 1 


,_i r(/3tt 0 + i) 
r(/3«o + K m ) 


np) 


Krr 


(7 + Efr^* 


where r. = m r m . Taking the logarithm, using asymptotic expansions for the Gamma terms, and ignoring o(/3) terms 
yields 

M 

(a 0 - 1) In k 0 - 7 0 Ko + (M - l)lna 0 + Y {(r m - l)ln/t 0 + P(k 0 + K m )\n[/3(ti 0 + K m )]} 

m—1 

M 

Y { P(t^o + K m ) - K m [P In P- p\+p In t* mj - P(k 0 + K m ) In (7 + t* m .) j , 

m—1 

where t* rn . = Yif=1 Knj- order to retain the effects of the hyperparameters in the asymptotics, set «o = exp(—£i/3) and 
70 = exp(^ 2 ^)- Thus, «o —> 0 as p —> 00 . We require that lirnsup^^ K 070 < 00 , so without loss of generality we can 
choose kq = 7 q" = exp(— £?p) to obtain 


~P 


M 


mm- i) + Y 


m—1 



1 )-Ef=rlnf 


j = 1 


mj 


+ K m In ([7 + t* m . 
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Thus, the objective function to minimize is 

L M 

C 53 KL ( x e\\Ps Tl ) + & M + 53 {&( r m - 1) - Ef= m i ln *mj - K m In ([7 + CJ++)} . (B.8) 

i=\ m—1 


Alternatively, the small variance asymptotics can be derived in the case where {p+>o is not integrated out. To do so, we 
first rewrite the HTTP generative model in an equivalent form, with // () integrated out: 


7 To ~ GEM(«o) 


IB.9) 

«o ~ Gam(ao, 7 o) 


(B.10) 

7T 4 | 7T 0 DT(+ 0 7r 0 ), 

i = 1 , 2 ,-.. 

(B.l 1) 

Ki 1 7 T 0 Gam(/3, 7 ), 

i = 1 , 2 ,... 

IB.12) 

Sfc | {7I”i}j = l+fc — 1 ~ Tlsfc-i 


IB.13) 

tk | ~ Gam(/3, n Sk ). 


IB.14) 


For 0 < i < M, 1 < j < M, let 7 Tjj = 7 r,; ? and for 0 < i < M, let 7rj,M+i — 1 — Ej3i n ij- Integrating out {«+>i, the 
contribution to the likelihood from the HTTP prior is now 


p(Wk,ko,tt 1/3,70,7, a 0 ) 

= p(«o I at 0 ,7 o)p(7f 0 | a 0 )p(n 1:M \ Pk 0 tt 0 )p(S k \ n 1:M )p{T K I P, 7, «o) 


M 


oc K, 


“o- 1 „- 7 oko 
o e 


n 


7T0 i 


i-r-w. 


i =1 


K 


l,a 0 Dir( 7 r 4 |/ 3 k 0 7 To) ) p(Tk \ P, 7 . «o) 


OC AC, 


a o —1 —7o^o 
0 e 


M I 

TT 

fr(l + a 0 ) / 

H 1 

i= 1 1 

[ r(ao) 


■> 2 — 1 

-3=i 7r °J, 


r (/3«o) n 


k=1 
7T'- 


Q ° 1 M+l -/3 k 0 7T0j-1 ^ 


L = i r(/3tc 0 7r 0j ) 


) 




M 


x 11 x n 


n 

k=1 


P(£( Ko + Ff m )) 0 lf= m i 


(B.15) 

(B.16) 

(B.17) 


IB.18) 


=i T ^ Km (7 + E*+++ (k ° + ^' 


We use a slightly different limiting process, with 70 = = £ 2 * a positive constant, and scale the multinomial distributions 

( |B.13| > by /3£. Taking the logarithm and and ignoring o(/3) terms as before yields 


M 


M+l 


53 { ln «o + +0 1h/3ko -P+Yl {-^K 0 7r 0 ,i ln(/3K 0 7toj) + Pkottqj + Pkqttoj In77+ 
i =1 [ 3=1 

if M 

+ X7*; ln 7+-i, Sfc + 53 {Efr+inCi -/3 A mln([7 + +.]+lm)} 

/c=1 m—1 

m r M+i 

53 < -+£i + 53 {“+ 0 +J ln(7f 0ij ) + /3rt 0 7f 0 ,j lnify} 

i=i [ 3=1 

K M 

+ 53 ^ ln ^h-us k + 53 { Ef=”i P ln Knj - P K ™ In ([7 + CJ/-F+)} 

fc=l m—1 

(KM 

-ft S + ^53 ln 7f Sfe - 1>Sfe + 53 {6KL(7f 0 ||7f m ) - EjTjlnCi - ^mln([7 + C.]+im)} 


fc=l 


m—1 
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(a) 


(b) 




(c) (d) 

Figure C.6. Mean error vs CPU runtime for (a) Synthetic 1; (b) Synthetic 2; (c) MS; and (d) MIMIC datasets. In each case the JUMP- 
means algorithms have better or comparable performance to other standard methods of inference in MJPs. 


Thus, the objective function to minimize is 
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C. Time-accuracy plots for the experiments 

In the main paper we include the error versus iteration as it is more objective than time-accuracy results. In Fig. |C.6| we 
compare the time-accuracy across different methods for different datasets. EM, PMCMC, and JUMP-means are imple¬ 
mented in Java and MCMC is implemented in Python. To plot the MCMC results, we give a speed boost of lOOx in the 
results to compensate for Python’s slow interpreter. From our experience with scientific computing applications, we believe 
this is a generous adjustment. Also we note that the EM implementation used in our experiments is not the most optimized 
in terms of time per iteration. However, our goal is to show that JUMP-means can achieve comparable performance with a 
reasonable implementation of MCMC and EM. 


D. Scaling experiments 

For the scaling experiments we generated 4 datasets consisting of 10 2 to 10 s sequences. All datasets are sampled from 
a single hidden state MJP with 5 hidden states and 5 possible observations. For the 20 observations in each sequence a 
Gaussian likelihood is used. Finally, for the held out results, we categorized the observations in 5 bins, removed 30% of 
the data points and predicted their category. 











